image.png

Bioframe enables flexible and scalable operations on Pandas dataframes of genomic intervals.¶

A sequenced genome provides a common coordinate system for locating features of interest.¶

image.png

image.png

Image from Fantastic Genes and Where to Find Them by Maria Gallegos

image.png

Positions and intervals provide addresses for anything that can be mapped to a reference genome:

  • Alignments of short sequencing reads
  • Sites of genetic variation
  • Individual genes and their substructual components
  • Locations occupied by DNA-binding proteins in particular cell types
  • And much more!

It is no wonder that positions and intervals are key fields in most genomic data files.

For example, next-generation sequencing (NGS) technologies can map all kinds of functional biomolecular information along this coordinate system.

image.png

Show a grey line - genome - with some genes, orange intervals - like on the figure below Then show another track - let’s say H3K4me3 (just say that that you can do experiment where you can measure gene activity or something like that) And then pose a question - we would like to find a subset of active genes in this cell line or whatever Therefore - OVERLAP ! Then you say that this is the most basic/fundamental operation in genomics

Why Bioframe?¶

Genomic intervals are one of the most prevalent data structures in computational genome biology.

Interval arithmetic provides a language for asking questions about relationships between features.

There are excellent and widely used command-line tools, such as BEDTools for genome interval arithmetic on common text file formats (e.g. BED).

image.png

There is also a Python interface pyBedTools. However, it roundtrips data into and out of the command line program through text file intermediates, which is inefficient and places unnecessary limitations.

We sought to create a flexible, Pythonic implementation built entirely with numpy and pandas, with an API smoothly integrated with the Python data ecosystem.

In our implementation, genomic interval sets are simply dataframes.

No need to introduce higher-level data structures.

BED frames¶

The core objects in bioframe are pandas DataFrames of genomic intervals, or BedFrames defined by chrom, start and end columns (the names are configurable). We don't use opaque wrapper objects or DataFrame subclasses.

In [34]:
import bioframe as bf

df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14]],
    columns=['chrom', 'start', 'end']
)

display(df1)

bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
plt.title('bedFrame1 intervals');
chrom start end
0 chr1 1 5
1 chr1 3 8
2 chr1 8 10
3 chr1 12 14
In [35]:
df2 = bf.from_any(
    [['chr1', 4, 8],
     ['chr1', 10, 11]], 
    name_col='chrom'
)

display(df2)

bf.vis.plot_intervals(df2, show_coords=True, xlim=(0,16), colors='lightpink')
plt.title('bedFrame2 intervals');
chrom start end
0 chr1 4 8
1 chr1 10 11

Interval operations¶

Overlap¶

The most common operation is to calculate the overlaps between two sets of genomic intervals.

In [30]:
overlaps = bf.overlap(df1, df2, how='inner', suffixes=('_1','_2'))
overlaps
Out[30]:
chrom_1 start_1 end_1 chrom_2 start_2 end_2
0 chr1 1 5 chr1 4 8
1 chr1 3 8 chr1 4 8

Cluster¶

In [36]:
df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14],
    ],
    columns=['chrom', 'start', 'end']
)

bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
In [37]:
df_annotated = bf.cluster(df1, min_dist=0)
display(df_annotated)
bf.vis.plot_intervals(df_annotated, labels=df_annotated['cluster'], show_coords=True, xlim=(0,16))
chrom start end cluster cluster_start cluster_end
0 chr1 1 5 0 1 10
1 chr1 3 8 0 1 10
2 chr1 8 10 0 1 10
3 chr1 12 14 1 12 14

Merge¶

Instead of returning cluster assignments,bioframe.merge returns a new dataframe of merged genomic intervals. As with bioframe.cluster, using min_dist=0 and min_dist=None gives different results.

If min_dist=0, this returns a dataframe of two intervals:

In [38]:
df_merged = bf.merge(df1, min_dist=0)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
chrom start end n_intervals
0 chr1 1 10 3
1 chr1 12 14 1

If min_dist=None, this returns a dataframe of three intervals:

In [39]:
df_merged = bf.merge(df1, min_dist=None)
display(df_merged)
bf.vis.plot_intervals(df_merged, show_coords=True, xlim=(0,16))
chrom start end n_intervals
0 chr1 1 8 2
1 chr1 8 10 1
2 chr1 12 14 1

Closest¶

It is often useful not only to find features that overlap, but also features that are nearby along the genome.

In [42]:
closest_intervals = bf.closest(df1, df2, suffixes=('_1','_2'))
display(closest_intervals)
chrom_1 start_1 end_1 chrom_2 start_2 end_2 distance
0 chr1 1 5 chr1 4 8 0
1 chr1 3 8 chr1 4 8 0
2 chr1 8 10 chr1 4 8 0
3 chr1 12 14 chr1 10 11 1

By default, bioframe.closest reports overlapping intervals. This can be modified by passing ignore_overlap=True. Note the closest pair #2 and #3, which did not overlap, remain the same.

Closest intervals within a single DataFrame can be found simply by passing a single dataframe to bioframe.closest. The number of closest intervals to report per query interval can be adjusted with k.

In [45]:
bf.closest(df1, k=2)
Out[45]:
chrom start end chrom_ start_ end_ distance
0 chr1 1 5 chr1 3 8 0
1 chr1 1 5 chr1 8 10 3
2 chr1 3 8 chr1 1 5 0
3 chr1 3 8 chr1 8 10 0
4 chr1 8 10 chr1 3 8 0
5 chr1 8 10 chr1 12 14 2
6 chr1 12 14 chr1 8 10 2
7 chr1 12 14 chr1 3 8 4

Subtraction and set difference¶

Bioframe has two functions for computing differences between sets of intervals: at the level of basepairs and at the level of whole intervals.

Basepair-level subtraction is performed with bioframe.subtract:

In [49]:
subtracted_intervals = bf.subtract(df1, df2)
display(subtracted_intervals)
chrom start end
0 chr1 1 4
1 chr1 3 4
2 chr1 8 10
3 chr1 12 14
In [50]:
bf.vis.plot_intervals(subtracted_intervals, show_coords=True, xlim=(0,16))

Interval-level differences are calculated with bioframe.setdiff:

In [51]:
setdiff_intervals = bf.setdiff(df1, df2)
display(setdiff_intervals)
chrom start end
2 chr1 8 10
3 chr1 12 14
In [52]:
bf.vis.plot_intervals(setdiff_intervals, show_coords=True, xlim=(0,16))

Convenience: coverage and counting overlaps¶

For two sets of genomic features, it is often useful to calculate the number of basepairs covered and the number of overlapping intervals. While these are fairly straightforward to compute from the output of bioframe.overlap with pandas.groupby and column renaming, since these are very frequently used, they are provided as core bioframe functions.

In [46]:
df1_coverage = bf.coverage(df1, df2)
display(df1_coverage)
chrom start end coverage
0 chr1 1 5 1
1 chr1 3 8 4
2 chr1 8 10 0
3 chr1 12 14 0
In [47]:
df1_coverage_and_count = bf.count_overlaps(df1_coverage, df2)
display(df1_coverage_and_count)
chrom start end coverage count
0 chr1 1 5 1 1
1 chr1 3 8 4 1
2 chr1 8 10 0 0
3 chr1 12 14 0 0

Complement¶

Equally important to finding which genomic features overlap is finding those that do not. bioframe.complement returns a BedFrame of intervals not covered by any intervals in an input BedFrame.

Complments are defined relative to a genomic view. Here this is provided as a dictionary with a single chromosome of length 15:

In [53]:
df_complemented = bf.complement(df1, view_df={'chr1':15})
display(df_complemented)
chrom start end view_region
0 chr1 0 1 chr1
1 chr1 10 12 chr1
2 chr1 14 15 chr1
In [54]:
bf.vis.plot_intervals(df_complemented, show_coords=True, xlim=(0,16), colors='lightpink')

If no view is provided, complement is calculated per unique chromosome in the input with right limits of np.iinfo(np.int64).max.

In [55]:
df_complemented = bf.complement(df1)
display(df_complemented)
chrom start end view_region
0 chr1 0 1 chr1
1 chr1 10 12 chr1
2 chr1 14 9223372036854775807 chr1

Other ops¶

  • Expand and trim
  • Sorting
  • Selecting and slicing
In [61]:
view_df = pd.DataFrame(
    [
        ["chr1", 0, 4, "chr1p"],
        ["chr1", 5, 9, "chr1q"],
        ["chrX", 0, 50, "chrX"],
        ["chrM", 0, 10, "chrM"]],
    columns=["chrom", "start", "end", "name"],
)
In [62]:
df_unsorted = pd.DataFrame([
    ['chrM', 3, 8],
    ['chrM', 1, 5],
    ['chrX', 12, 14],
    ['chrX', 8, 10]],
    columns=['chrom', 'start', 'end']
)

bf.sort_bedframe(df_unsorted)
Out[62]:
chrom start end
0 chrM 1 5
1 chrM 3 8
2 chrX 8 10
3 chrX 12 14
In [63]:
bf.sort_bedframe(df_unsorted, view_df)
Out[63]:
chrom start end
0 chrX 8 10
1 chrX 12 14
2 chrM 1 5
3 chrM 3 8
In [64]:
bioframe.select(df_unsorted,'chrX:8-14')
Out[64]:
chrom start end
2 chrX 12 14
3 chrX 8 10
In [ ]:
#import hg
#hg.Viewconf.from_url("https://resgen.io/api/v1/viewconfs/Y_omIrpERgG01VsqmtMLVA/?raw=1")

File I/O and other utilities¶

Performance¶

Figure from paper

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Example workflow¶

In [ ]:
ctcf_peaks = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz", 
    schema='narrowPeak'
)
In [ ]:
ctcf_peaks
Out[ ]:
chrom start end name score strand fc -log10p -log10q relSummit
0 chr19 48309541 48309911 . 1000 . 5.04924 -1.0 0.00438 185
1 chr4 130563716 130564086 . 993 . 5.05052 -1.0 0.00432 185
2 chr1 200622507 200622877 . 591 . 5.05489 -1.0 0.00400 185
3 chr5 112848447 112848817 . 869 . 5.05841 -1.0 0.00441 185
4 chr1 145960616 145960986 . 575 . 5.05955 -1.0 0.00439 185
... ... ... ... ... ... ... ... ... ... ...
40582 chr8 22574315 22574744 . 1000 . 561.11939 -1.0 4.90268 243
40583 chr15 56246029 56246402 . 1000 . 569.05663 -1.0 4.90268 191
40584 chr1 150979463 150979845 . 1000 . 580.28482 -1.0 4.90268 194
40585 chr16 57649040 57649402 . 1000 . 602.95266 -1.0 4.90268 173
40586 chr12 54379625 54380042 . 1000 . 627.60723 -1.0 4.90268 203

40587 rows × 10 columns

In [ ]:
jaspar_url = 'http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/'
jaspar_motif_file = 'MA0139.1.tsv.gz'
ctcf_motifs = bioframe.read_table(
    jaspar_url + jaspar_motif_file,
    schema='jaspar',
    skiprows=1
)
In [ ]:
ctcf_motifs
Out[ ]:
chrom start end name score pval strand
0 chr1 11163 11182 CTCF 811 406 -
1 chr1 11222 11241 CTCF 959 804 -
2 chr1 11280 11299 CTCF 939 728 -
3 chr1 11339 11358 CTCF 837 455 -
4 chr1 11401 11420 CTCF 829 439 -
... ... ... ... ... ... ... ...
770044 chrY_KI270740v1_random 4600 4619 CTCF 815 412 -
770045 chrY_KI270740v1_random 5601 5620 CTCF 815 412 -
770046 chrY_KI270740v1_random 7601 7620 CTCF 815 412 -
770047 chrY_KI270740v1_random 8602 8621 CTCF 815 412 -
770048 chrY_KI270740v1_random 19753 19772 CTCF 806 395 -

770049 rows × 7 columns

In [ ]:
df = bioframe.overlap(
    ctcf_peaks, 
    ctcf_motifs, 
    # suffixes=('_1', ''), 
    return_index=True,
    how='left',
)
In [ ]:
motifs_per_peak = df.groupby(["index"])["index_"].count().values

bins = np.arange(0, np.max(motifs_per_peak))
counts, _ = np.histogram(motifs_per_peak, bins)
plt.bar(bins[:-1], height=counts, align='center')
plt.xlabel('number of overlapping motifs per peak')
plt.ylabel('number of peaks')
plt.semilogy();

print(f'fraction of peaks without motifs {np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak),2)}');
fraction of peaks without motifs 0.14
In [ ]:
df
Out[ ]:
index chrom start end name score strand fc -log10p -log10q relSummit index_ chrom_ start_ end_ name_ score_ pval_ strand_
0 0 chr19 48309541 48309911 . 1000.0 . 5.04924 -1.0 0.00438 185.0 <NA> None <NA> <NA> None NaN NaN None
1 1 chr4 130563716 130564086 . 993.0 . 5.05052 -1.0 0.00432 185.0 <NA> None <NA> <NA> None NaN NaN None
2 2 chr1 200622507 200622877 . 591.0 . 5.05489 -1.0 0.00400 185.0 <NA> None <NA> <NA> None NaN NaN None
3 3 chr5 112848447 112848817 . 869.0 . 5.05841 -1.0 0.00441 185.0 <NA> None <NA> <NA> None NaN NaN None
4 4 chr1 145960616 145960986 . 575.0 . 5.05955 -1.0 0.00439 185.0 <NA> None <NA> <NA> None NaN NaN None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48626 40584 chr1 150979463 150979845 . 1000.0 . 580.28482 -1.0 4.90268 194.0 40407 chr1 150979668 150979687 CTCF 925.0 678.0 -
48627 40585 chr16 57649040 57649402 . 1000.0 . 602.95266 -1.0 4.90268 173.0 261215 chr16 57649185 57649204 CTCF 918.0 656.0 -
48628 40585 chr16 57649040 57649402 . 1000.0 . 602.95266 -1.0 4.90268 173.0 261216 chr16 57649229 57649248 CTCF 829.0 439.0 +
48629 40586 chr12 54379625 54380042 . 1000.0 . 627.60723 -1.0 4.90268 203.0 153951 chr12 54379782 54379801 CTCF 863.0 511.0 -
48630 40586 chr12 54379625 54380042 . 1000.0 . 627.60723 -1.0 4.90268 203.0 153952 chr12 54379835 54379854 CTCF 885.0 564.0 -

48631 rows × 19 columns

In [ ]:
%%bash

echo "Hello World"
which sed
Hello World
/bin/sed

Bioframe for application programming¶

In [1]:
import hg

config = hg.Viewconf.from_url('https://higlass.io/api/v1/viewconf?d=default')

config
Out[1]:

For more information¶

Bioframe docs

Open2C

In [ ]: